Script (Unicode)

In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems.^[1] Some scripts support one and only one writing system and language, for example, Armenian. Other scripts support many different writing systems. For example, the Latin script supports English, French, German, Italian, Vietnamese and Latin. Some languages make use of multiple alternate writing systems, thus also use several scripts. In Turkish, the Arabic script was used before the 20th century, but transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.

Complementary are the Unicode symbols: scripts and symbols cover all Unicode characters. The unified diacritical characters and unified punctuation characters frequently have the “common” or “inherited” script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode 6.0 includes 26 ancient and historic scripts and 67 modern scripts. Unicode is actively working on many more as indicated by its roadmap.

[hide]

1 Definition and classification
2 Character categories within scripts
3 Table of scripts in Unicode
4 See also
5 References

Definition and classification

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a “Swedish O”) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.

Common and inherited scripts

Unicode can assign a character in the UCS to a single script only. However, many characters — those that are not part of a formal natural language writing system or are unified across many writing systems may be used in more than one script. For example, currency signs, symbols, numerals and punctuation marks. In these cases Unicode defines them as belonging to the common script (ISO 15924 code "Zyyy"). All in all Unicode has 6379 characters defined as "Common" script.

In addition, many diacritics and non-spacing combining characters may be applied to characters from more than one script. In these cases Unicode assigns them to the inherited script (ISO 15924 code Zinh), which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, U+0308 ̈ combining diaeresis may combine with either U+0065 e latin small letter e to create a Latin "ë", or with U+0435 е cyrillic small letter ie for the Cyrillic "ё". In the former case it inherits the Latin script of the base character whereas in the latter case it inherits the Cyrillic script of the base character. 523 Characters in Unicode are of the inherited script.

Ancient and historic scripts

Ancient and historic scripts in Unicode^[1]

Avestan Brāhmī Carian Coptic Sumero-Akkadian Cypriot Egyptian Hieroglyphs Glagolitic Gothic Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kharoshthi Linear B Lycian Lydian Ogham Old Italic Old Persian Phags-pa Phoenician Old South Arabian Old Turkic Runic Ugaritic
^ Unicode. As of version 5.2 (Brāhmī: 6.0)

Unicode includes 25 ancient scripts (out of use a thousand years or more) and historic scripts (out of use several hundred years)^[2]

Script versus writing system

Main article: Writing system

"Writing system" is sometimes treated as a synonym for script. However it also can be used as the specific concrete writing system supported by a script. For example the Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts.

Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term complex system is sometimes used to describe those where the admixture makes classification problematic.

Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms.

Character categories within scripts

Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letter and modifier letters. Some characters are considered titlecase letters for a few precomposed ligatures such as ǲ (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all compatibility characters and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.

Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as “other letter” or “modifier letter”. Ideographs such as Unihan ideographs are also categorized as “other letters”. A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are neither uppercase nor lowercase.

Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that scripts. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.

Table of scripts in Unicode

Unicode defines 97 script names (called "Alias" or "Property value alias"), based on the ISO 15924 list, that are used in Unicode 6.0.^[3] These 97 contain 25 ancient or historic scripts, the generic Zyyy Common (Code for undetermined script) script name for characters that are used in multiple script like diacritics, and the general Zzzz Unknown (Code for undetermined script). Not used are, among others, the script codes: Zsym (Symbols) and Zmth (Mathematical notation). These are considered not to be scripts in Unicode sense.

ISO 15924 script codes^[a]^[b] and Unicode^[c]^[d]
ISO 15924			script in Unicode^[e]
Code	Nr	Name	Alias^[f]	Direction	Version	Characters	Remark
Afak	439	Afaka					Not in Unicode
Arab	160	Arabic	Arabic	R-to-L	1.0	1,051
Armi	124	Imperial Aramaic	Imperial Aramaic	R-to-L	5.2	31	Ancient/historic
Armn	230	Armenian	Armenian	L-to-R	1.0	90
Avst	134	Avestan	Avestan	R-to-L	5.2	61	Ancient/historic
Bali	360	Balinese	Balinese	L-to-R	5.0	121
Bamu	435	Bamum	Bamum	L-to-R	5.2	657
Bass	259	Bassa Vah			?	(36)	Provisionally accepted for Unicode^[g]
Batk	365	Batak	Batak	L-to-R	6.0	56
Beng	325	Bengali	Bengali	L-to-R	1.0	92
Blis	550	Blissymbols					Not in Unicode
Bopo	285	Bopomofo	Bopomofo	L-to-R	1.0	70
Brah	300	Brahmi	Brahmi	L-to-R	6.0	108	Ancient/historic
Brai	570	Braille	Braille	L-to-R	3.0	256
Bugi	367	Buginese	Buginese	L-to-R	4.1	30
Buhd	372	Buhid	Buhid	L-to-R	3.2	20
Cakm	349	Chakma			6.1?	67?	Included in beta release of Unicode 6.1.0^[h]
Cans	440	Unified Canadian Aboriginal Syllabics	Canadian Aboriginal	L-to-R	3.0	710
Cari	201	Carian	Carian	L-to-R	5.1	49	Ancient/historic
Cham	358	Cham	Cham	L-to-R	5.1	83
Cher	445	Cherokee	Cherokee	L-to-R	3.0	85
Cirt	291	Cirth					Not in Unicode
Copt	204	Coptic	Coptic	L-to-R	1.0	135	(disunified from Greek in 4.1) Ancient/historic
Cprt	403	Cypriot	Cypriot	R-to-L	4.0	55	Ancient/historic
Cyrl	220	Cyrillic	Cyrillic	L-to-R	1.0	408
Cyrs	221	Cyrillic (Old Church Slavonic variant)					Not in Unicode
Deva	315	Devanagari (Nagari)	Devanagari	L-to-R	1.0	150
Dsrt	250	Deseret (Mormon)	Deseret	L-to-R	3.1	80
Dupl	755	Duployan shorthand, Duployan stenography			?	(143)	Provisionally accepted for Unicode^[g]
Egyd	070	Egyptian demotic					Not in Unicode
Egyh	060	Egyptian hieratic					Not in Unicode
Egyp	050	Egyptian hieroglyphs	Egyptian Hieroglyphs	L-to-R	5.2	1,071	Ancient/historic
Elba	226	Elbasan			?	(40)	Provisionally accepted for Unicode^[g]
Ethi	430	Ethiopic (Geʻez)	Ethiopic	L-to-R	3.0	495
Geok	241	Khutsuri (Asomtavruli and Nuskhuri)					Not in Unicode
Geor	240	Georgian (Mkhedruli)	Georgian	L-to-R	1.0	120
Glag	225	Glagolitic	Glagolitic	L-to-R	4.1	94	Ancient/historic
Goth	206	Gothic	Gothic	L-to-R	3.1	27	Ancient/historic
Gran	343	Grantha					Not in Unicode
Grek	200	Greek	Greek	L-to-R	1.0	511
Gujr	320	Gujarati	Gujarati	L-to-R	1.0	83
Guru	310	Gurmukhi	Gurmukhi	L-to-R	1.0	79
Hang	286	Hangul (Hangŭl, Hangeul)	Hangul	L-to-R	1.0	11,739	Hangul syllables relocated in 2.0
Hani	500	Han (Hanzi, Kanji, Hanja)	Han	L-to-R	1.0	75,960
Hano	371	Hanunoo (Hanunóo)	Hanunoo	L-to-R	3.2	21
Hans	501	Han (Simplified variant)					Subset Hani
Hant	502	Han (Traditional variant)					Subset Hani
Hebr	125	Hebrew	Hebrew	R-to-L	1.0	133
Hira	410	Hiragana	Hiragana	L-to-R	1.0	91
Hluw	080	Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)					Not in Unicode
Hmng	450	Pahawh Hmong					Not in Unicode
Hrkt	412	Japanese syllabaries (alias for Hiragana + Katakana)	Katakana or Hiragana				See Hira, Kana
Hung	176	Old Hungarian			?	(109)	Provisionally accepted for Unicode^[g]
Inds	610	Indus (Harappan)					Not in Unicode
Ital	210	Old Italic (Etruscan, Oscan, etc.)	Old Italic	L-to-R	3.1	35	Ancient/historic
Java	361	Javanese	Javanese	L-to-R	5.2	91
Jpan	413	Japanese (alias for Han + Hiragana + Katakana)					See Hani, Hira and Kana
Jurc	510	Jurchen					Not in Unicode
Kali	357	Kayah Li	Kayah Li	L-to-R	5.1	48
Kana	411	Katakana	Katakana	L-to-R	1.0	300
Khar	305	Kharoshthi	Kharoshthi	R-to-L	4.1	65	Ancient/historic
Khmr	355	Khmer	Khmer	L-to-R	3.0	146
Khoj	322	Khojki					Not in Unicode
Knda	345	Kannada	Kannada	L-to-R	1.0	86
Kore	287	Korean (alias for Hangul + Han)					See Hani and Hang
Kpel	436	Kpelle					Not in Unicode
Kthi	317	Kaithi	Kaithi	L-to-R	5.2	66	Ancient/historic
Lana	351	Tai Tham (Lanna)	Tai Tham	L-to-R	5.2	127
Laoo	356	Lao	Lao	L-to-R	1.0	65
Latf	217	Latin (Fraktur variant)		L-to-R			typographic variant of Latin
Latg	216	Latin (Gaelic variant)		L-to-R			typographic variant of Latin
Latn	215	Latin	Latin	L-to-R	1.0	1,267
Lepc	335	Lepcha (Róng)	Lepcha	L-to-R	5.1	74
Limb	336	Limbu	Limbu	L-to-R	4.0	66
Lina	400	Linear A			?	(341)	Provisionally accepted for Unicode^[g]
Linb	401	Linear B	Linear B	L-to-R	4.0	211	Ancient/historic
Lisu	399	Lisu (Fraser)	Lisu	L-to-R	5.2	48
Loma	437	Loma					Not in Unicode
Lyci	202	Lycian	Lycian	L-to-R	5.1	29	Ancient/historic
Lydi	116	Lydian	Lydian	R-to-L	5.1	27	Ancient/historic
Mand	140	Mandaic, Mandaean	Mandaic	R-to-L	6.0	29
Mani	139	Manichaean			?	(51)	Provisionally accepted for Unicode^[g]
Maya	090	Mayan hieroglyphs					Not in Unicode
Mend	438	Mende					Not in Unicode
Merc	101	Meroitic Cursive			6.1?	26?	Included in beta release of Unicode 6.1.0^[h]
Mero	100	Meroitic Hieroglyphs			6.1?	32?	Included in beta release of Unicode 6.1.0^[h]
Mlym	347	Malayalam	Malayalam	L-to-R	1.0	98
Mong	145	Mongolian	Mongolian	L-to-R	3.0	153	Includes Clear, Manchu scripts
Moon	218	Moon (Moon code, Moon script, Moon type)					Not in Unicode
Mroo	199	Mro, Mru			?	(43)	Provisionally accepted for Unicode^[g]
Mtei	337	Meitei Mayek (Meithei, Meetei)	Meetei Mayek	L-to-R	5.2	56
Mymr	350	Myanmar (Burmese)	Myanmar	L-to-R	3.0	188
Narb	106	Old North Arabian (Ancient North Arabian)			?	(32)	Provisionally accepted for Unicode^[g]
Nbat	159	Nabataean			?	(40)	Provisionally accepted for Unicode^[g]
Nkgb	420	Nakhi Geba ('Na-'Khi ²Ggŏ-¹baw, Naxi Geba)					Not in Unicode
Nkoo	165	N’Ko	N'Ko	R-to-L	5.0	59
Nshu	499	Nüshu			?	(389)	Provisionally accepted for Unicode^[g]
Ogam	212	Ogham	Ogham	L-to-R	3.0	29	Ancient/historic
Olck	261	Ol Chiki (Ol Cemet’, Ol, Santali)	Ol Chiki	L-to-R	5.1	48
Orkh	175	Old Turkic, Orkhon Runic	Old Turkic	R-to-L	5.2	73	Ancient/historic
Orya	327	Oriya	Oriya	L-to-R	1.0	90
Osma	260	Osmanya	Osmanya	L-to-R	4.0	40
Palm	126	Palmyrene			?	(32)	Provisionally accepted for Unicode^[g]
Perm	227	Old Permic					Not in Unicode
Phag	331	Phags-pa	Phags-pa	L-to-R	5.0	56	Ancient/historic
Phli	131	Inscriptional Pahlavi		Inscriptional_Pahlavi	5.2	27	Ancient/historic
Phlp	132	Psalter Pahlavi					Not in Unicode
Phlv	133	Book Pahlavi					Not in Unicode
Phnx	115	Phoenician	Phoenician	R-to-L	5.0	29	Ancient/historic
Plrd	282	Miao (Pollard)			6.1?	133?	Included in beta release of Unicode 6.1.0^[h]
Prti	130	Inscriptional Parthian	Inscriptional Parthian	R-to-L	5.2	30	Ancient/historic
Qaaa	900	Reserved for private use (start)					Not in Unicode
Qaai	908	(Private use)		Inherited		523	In versions prior to 5.2 (from 5.2: 'Zinh')
Qabx	949	Reserved for private use (end)					Not in Unicode
Rjng	363	Rejang (Redjang, Kaganga)	Rejang	L-to-R	5.1	37
Roro	620	Rongorongo					Not in Unicode
Runr	211	Runic	Runic	L-to-R	3.0	78	Ancient/historic
Samr	123	Samaritan	Samaritan	R-to-L	5.2	61
Sara	292	Sarati					Not in Unicode
Sarb	105	Old South Arabian	Old South Arabian	R-to-L	5.2	32	Ancient/historic
Saur	344	Saurashtra	Saurashtra	L-to-R	5.1	81
Sgnw	095	SignWriting					Not in Unicode
Shaw	281	Shavian (Shaw)	Shavian	L-to-R	4.0	48
Shrd	319	Sharada, Śāradā			6.1?	83?	Included in beta release of Unicode 6.1.0^[h]
Sind	318	Khudawadi, Sindhi					Not in Unicode
Sinh	348	Sinhala	Sinhala	L-to-R	3.0	80
Sora	398	Sora Sompeng			6.1?	35?	Included in beta release of Unicode 6.1.0^[h]
Sund	362	Sundanese	Sundanese	L-to-R	5.1	55
Sylo	316	Syloti Nagri	Syloti Nagri	L-to-R	4.1	44
Syrc	135	Syriac	Syriac	R-to-L	3.0	77
Syre	138	Syriac (Estrangelo variant)					Not in Unicode
Syrj	137	Syriac (Western variant)					Not in Unicode
Syrn	136	Syriac (Eastern variant)					Not in Unicode
Tagb	373	Tagbanwa	Tagbanwa	L-to-R	3.2	18
Takr	321	Takri, Ṭākrī, Ṭāṅkrī			6.1?	66?	Included in beta release of Unicode 6.1.0^[h]
Tale	353	Tai Le	Tai Le	L-to-R	4.0	35
Talu	354	New Tai Lue	New Tai Lue	L-to-R	4.1	83
Taml	346	Tamil	Tamil	L-to-R	1.0	72
Tang	520	Tangut			?	(5,910)	Provisionally accepted for Unicode^[g]
Tavt	359	Tai Viet	Tai Viet	L-to-R	5.2	72
Telu	340	Telugu	Telugu	L-to-R	1.0	93
Teng	290	Tengwar					Not in Unicode
Tfng	120	Tifinagh (Berber)	Tifinagh	L-to-R	4.1	57
Tglg	370	Tagalog (Baybayin, Alibata)	Tagalog	L-to-R	3.2	20
Thaa	170	Thaana	Thaana	R-to-L	3.0	50
Thai	352	Thai	Thai	L-to-R	1.0	86
Tibt	330	Tibetan	Tibetan	L-to-R	1.0	207	(removed in 1.1 and reintroduced in 2.0)
Tirh	326	Tirhuta					Not in Unicode
Ugar	040	Ugaritic	Ugaritic	L-to-R	4.0	31	Ancient/historic
Vaii	470	Vai	Vai	L-to-R	5.1	300
Visp	280	Visible Speech					Not in Unicode
Wara	262	Warang Citi (Varang Kshiti)					Not in Unicode
Wole	480	Woleai					Not in Unicode
Xpeo	030	Old Persian	Old Persian	L-to-R	4.1	50	Ancient/historic
Xsux	020	Cuneiform, Sumero-Akkadian	Cuneiform	L-to-R	5.0	982	Ancient/historic
Yiii	460	Yi	Yi	L-to-R	3.0	1,220
Zinh	994	Code for inherited script	Inherited				In version 5.2 (prior versions: 'Qaai')
Zmth	995	Mathematical notation					Not a 'script' in Unicode
Zsym	996	Symbols					Not a 'script' in Unicode
Zxxx	997	Code for unwritten documents					Not in Unicode
Zyyy	998	Code for undetermined script	Common			6,379
Zzzz	999	Code for uncoded script	Unknown				all other code points
Notes ^ ISO 15924 publications (at Unicode.org site) As of 21 June 2011 (2011 -06-21)^[update] ^ ISO 15924 Normative text file (Alias names are informal) ^ ISO 15924 Changes (including Aliases for Unicode) ^ As of Unicode version 6.0 ^ Unicode charts ^ Unicode uses the Alias (Property Value Alias) as the script-name. These Alias names are part of Unicode and are published informatively next to ISO 15924 [1] [2]

References

Unicode

Code points

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark and Right-to-left mark Soft hyphen Zero-width non-breaking space Zero-width joiner Zero-width non-joiner Zero-width space

Miscellaneous lists	Combining character Duplicate characters Graphic characters

Processing

Algorithms	Bi-directional text Collation (ISO 14651) Equivalence

Transformation	BOCU-1 CESU-8 UTF-1 UTF-7 UTF-8 UTF-9/UTF-18 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Punycode SCSU Comparison

On pairs
of code points

Usage

Related standards